EFFECTIVE CREDIT RISK MANAGEMENT IN FINANCE USING BIG DATA TECHNOLOGY¶

This Project analyzed the tendency of credit default in the banking sector by utilizing different machine learning algorithm.¶

The project was run on a Virtual Machine Pro 17, using an Ubuntu Operating system.¶

In [87]:
pip install pandas
Requirement already satisfied: pandas in ./lib/python3.10/site-packages (1.5.2)
Requirement already satisfied: python-dateutil>=2.8.1 in ./lib/python3.10/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in ./lib/python3.10/site-packages (from pandas) (2022.7)
Requirement already satisfied: numpy>=1.21.0 in ./lib/python3.10/site-packages (from pandas) (1.24.1)
Requirement already satisfied: six>=1.5 in ./lib/python3.10/site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
Note: you may need to restart the kernel to use updated packages.
In [88]:
pip install numpy
Requirement already satisfied: numpy in ./lib/python3.10/site-packages (1.24.1)
Note: you may need to restart the kernel to use updated packages.
In [89]:
pip install plotly
Collecting plotly
  Downloading plotly-5.12.0-py2.py3-none-any.whl (15.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15.2/15.2 MB 2.9 MB/s eta 0:00:00m eta 0:00:01[36m0:00:01
Collecting tenacity>=6.2.0
  Downloading tenacity-8.1.0-py3-none-any.whl (23 kB)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.12.0 tenacity-8.1.0
Note: you may need to restart the kernel to use updated packages.
In [90]:
pip install seaborn
Requirement already satisfied: seaborn in ./lib/python3.10/site-packages (0.12.2)
Requirement already satisfied: pandas>=0.25 in ./lib/python3.10/site-packages (from seaborn) (1.5.2)
Requirement already satisfied: numpy!=1.24.0,>=1.17 in ./lib/python3.10/site-packages (from seaborn) (1.24.1)
Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in ./lib/python3.10/site-packages (from seaborn) (3.6.2)
Requirement already satisfied: cycler>=0.10 in ./lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in ./lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (4.38.0)
Requirement already satisfied: packaging>=20.0 in ./lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (21.3)
Requirement already satisfied: python-dateutil>=2.7 in ./lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (2.8.2)
Requirement already satisfied: pyparsing>=2.2.1 in ./lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (3.0.9)
Requirement already satisfied: contourpy>=1.0.1 in ./lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.0.6)
Requirement already satisfied: pillow>=6.2.0 in ./lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (9.4.0)
Requirement already satisfied: kiwisolver>=1.0.1 in ./lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.4.4)
Requirement already satisfied: pytz>=2020.1 in ./lib/python3.10/site-packages (from pandas>=0.25->seaborn) (2022.7)
Requirement already satisfied: six>=1.5 in ./lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.1->seaborn) (1.16.0)
Note: you may need to restart the kernel to use updated packages.
In [91]:
pip install matplotlib
Requirement already satisfied: matplotlib in ./lib/python3.10/site-packages (3.6.2)
Requirement already satisfied: pyparsing>=2.2.1 in ./lib/python3.10/site-packages (from matplotlib) (3.0.9)
Requirement already satisfied: packaging>=20.0 in ./lib/python3.10/site-packages (from matplotlib) (21.3)
Requirement already satisfied: python-dateutil>=2.7 in ./lib/python3.10/site-packages (from matplotlib) (2.8.2)
Requirement already satisfied: numpy>=1.19 in ./lib/python3.10/site-packages (from matplotlib) (1.24.1)
Requirement already satisfied: cycler>=0.10 in ./lib/python3.10/site-packages (from matplotlib) (0.11.0)
Requirement already satisfied: contourpy>=1.0.1 in ./lib/python3.10/site-packages (from matplotlib) (1.0.6)
Requirement already satisfied: pillow>=6.2.0 in ./lib/python3.10/site-packages (from matplotlib) (9.4.0)
Requirement already satisfied: kiwisolver>=1.0.1 in ./lib/python3.10/site-packages (from matplotlib) (1.4.4)
Requirement already satisfied: fonttools>=4.22.0 in ./lib/python3.10/site-packages (from matplotlib) (4.38.0)
Requirement already satisfied: six>=1.5 in ./lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
Note: you may need to restart the kernel to use updated packages.
In [4]:
pip install scikit.learn
Collecting scikit.learn
  Downloading scikit_learn-1.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.5/9.5 MB 3.3 MB/s eta 0:00:00m eta 0:00:010:01:010m
Requirement already satisfied: numpy>=1.17.3 in ./lib/python3.10/site-packages (from scikit.learn) (1.24.1)
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-3.1.0-py3-none-any.whl (14 kB)
Collecting scipy>=1.3.2
  Downloading scipy-1.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34.4/34.4 MB 2.6 MB/s eta 0:00:00m eta 0:00:01[36m0:00:01
Collecting joblib>=1.1.1
  Downloading joblib-1.2.0-py3-none-any.whl (297 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 298.0/298.0 kB 2.1 MB/s eta 0:00:00m eta 0:00:01[36m0:00:01
Installing collected packages: threadpoolctl, scipy, joblib, scikit.learn
Successfully installed joblib-1.2.0 scikit.learn-1.2.0 scipy-1.10.0 threadpoolctl-3.1.0
Note: you may need to restart the kernel to use updated packages.
In [7]:
pip install yellowbrick
Collecting yellowbrick
  Downloading yellowbrick-1.5-py3-none-any.whl (282 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 282.6/282.6 kB 2.5 MB/s eta 0:00:00 MB/s eta 0:00:01:01
Requirement already satisfied: cycler>=0.10.0 in ./lib/python3.10/site-packages (from yellowbrick) (0.11.0)
Requirement already satisfied: scipy>=1.0.0 in ./lib/python3.10/site-packages (from yellowbrick) (1.10.0)
Requirement already satisfied: scikit-learn>=1.0.0 in ./lib/python3.10/site-packages (from yellowbrick) (1.2.0)
Requirement already satisfied: matplotlib!=3.0.0,>=2.0.2 in ./lib/python3.10/site-packages (from yellowbrick) (3.6.2)
Requirement already satisfied: numpy>=1.16.0 in ./lib/python3.10/site-packages (from yellowbrick) (1.24.1)
Requirement already satisfied: kiwisolver>=1.0.1 in ./lib/python3.10/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.4.4)
Requirement already satisfied: pyparsing>=2.2.1 in ./lib/python3.10/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in ./lib/python3.10/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (2.8.2)
Requirement already satisfied: packaging>=20.0 in ./lib/python3.10/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (21.3)
Requirement already satisfied: contourpy>=1.0.1 in ./lib/python3.10/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.0.6)
Requirement already satisfied: pillow>=6.2.0 in ./lib/python3.10/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (9.4.0)
Requirement already satisfied: fonttools>=4.22.0 in ./lib/python3.10/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (4.38.0)
Requirement already satisfied: joblib>=1.1.1 in ./lib/python3.10/site-packages (from scikit-learn>=1.0.0->yellowbrick) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in ./lib/python3.10/site-packages (from scikit-learn>=1.0.0->yellowbrick) (3.1.0)
Requirement already satisfied: six>=1.5 in ./lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.16.0)
Installing collected packages: yellowbrick
Successfully installed yellowbrick-1.5
Note: you may need to restart the kernel to use updated packages.
In [9]:
pip install xgboost
Collecting xgboost
  Downloading xgboost-1.7.3-py3-none-manylinux2014_x86_64.whl (193.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 193.6/193.6 MB 951.3 kB/s eta 0:00:00m eta 0:00:01[36m0:00:02
Requirement already satisfied: scipy in ./lib/python3.10/site-packages (from xgboost) (1.10.0)
Requirement already satisfied: numpy in ./lib/python3.10/site-packages (from xgboost) (1.24.1)
Installing collected packages: xgboost
Successfully installed xgboost-1.7.3
Note: you may need to restart the kernel to use updated packages.
In [13]:
pip install catboost
Requirement already satisfied: catboost in ./lib/python3.10/site-packages (1.1.1)
Requirement already satisfied: matplotlib in ./lib/python3.10/site-packages (from catboost) (3.6.2)
Requirement already satisfied: pandas>=0.24.0 in ./lib/python3.10/site-packages (from catboost) (1.5.2)
Requirement already satisfied: scipy in ./lib/python3.10/site-packages (from catboost) (1.10.0)
Requirement already satisfied: graphviz in ./lib/python3.10/site-packages (from catboost) (0.20.1)
Requirement already satisfied: plotly in ./lib/python3.10/site-packages (from catboost) (5.12.0)
Requirement already satisfied: six in ./lib/python3.10/site-packages (from catboost) (1.16.0)
Requirement already satisfied: numpy>=1.16.0 in ./lib/python3.10/site-packages (from catboost) (1.24.1)
Requirement already satisfied: python-dateutil>=2.8.1 in ./lib/python3.10/site-packages (from pandas>=0.24.0->catboost) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in ./lib/python3.10/site-packages (from pandas>=0.24.0->catboost) (2022.7)
Requirement already satisfied: packaging>=20.0 in ./lib/python3.10/site-packages (from matplotlib->catboost) (21.3)
Requirement already satisfied: pillow>=6.2.0 in ./lib/python3.10/site-packages (from matplotlib->catboost) (9.4.0)
Requirement already satisfied: fonttools>=4.22.0 in ./lib/python3.10/site-packages (from matplotlib->catboost) (4.38.0)
Requirement already satisfied: kiwisolver>=1.0.1 in ./lib/python3.10/site-packages (from matplotlib->catboost) (1.4.4)
Requirement already satisfied: contourpy>=1.0.1 in ./lib/python3.10/site-packages (from matplotlib->catboost) (1.0.6)
Requirement already satisfied: pyparsing>=2.2.1 in ./lib/python3.10/site-packages (from matplotlib->catboost) (3.0.9)
Requirement already satisfied: cycler>=0.10 in ./lib/python3.10/site-packages (from matplotlib->catboost) (0.11.0)
Requirement already satisfied: tenacity>=6.2.0 in ./lib/python3.10/site-packages (from plotly->catboost) (8.1.0)
Note: you may need to restart the kernel to use updated packages.
In [6]:
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from yellowbrick.classifier import ConfusionMatrix
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier

Importing my Dataset from my HDFS files¶

In [7]:
credit_pred = pd.read_csv ("/home/hdoop/Downloads/credit_risk_dataset.csv")
credit_pred
Out[7]:
person_age person_income person_home_ownership person_emp_length loan_intent loan_grade loan_amnt loan_int_rate loan_status loan_percent_income cb_person_default_on_file cb_person_cred_hist_length
0 22 59000 RENT 123.0 PERSONAL D 35000 16.02 1 0.59 Y 3
1 21 9600 OWN 5.0 EDUCATION B 1000 11.14 0 0.10 N 2
2 25 9600 MORTGAGE 1.0 MEDICAL C 5500 12.87 1 0.57 N 3
3 23 65500 RENT 4.0 MEDICAL C 35000 15.23 1 0.53 N 2
4 24 54400 RENT 8.0 MEDICAL C 35000 14.27 1 0.55 Y 4
... ... ... ... ... ... ... ... ... ... ... ... ...
32576 57 53000 MORTGAGE 1.0 PERSONAL C 5800 13.16 0 0.11 N 30
32577 54 120000 MORTGAGE 4.0 PERSONAL A 17625 7.49 0 0.15 N 19
32578 65 76000 RENT 3.0 HOMEIMPROVEMENT B 35000 10.99 1 0.46 N 28
32579 56 150000 MORTGAGE 5.0 PERSONAL B 15000 11.48 0 0.10 N 26
32580 66 42000 RENT 2.0 MEDICAL B 6475 9.99 0 0.15 N 30

32581 rows × 12 columns

In [8]:
credit_pred.head()
Out[8]:
person_age person_income person_home_ownership person_emp_length loan_intent loan_grade loan_amnt loan_int_rate loan_status loan_percent_income cb_person_default_on_file cb_person_cred_hist_length
0 22 59000 RENT 123.0 PERSONAL D 35000 16.02 1 0.59 Y 3
1 21 9600 OWN 5.0 EDUCATION B 1000 11.14 0 0.10 N 2
2 25 9600 MORTGAGE 1.0 MEDICAL C 5500 12.87 1 0.57 N 3
3 23 65500 RENT 4.0 MEDICAL C 35000 15.23 1 0.53 N 2
4 24 54400 RENT 8.0 MEDICAL C 35000 14.27 1 0.55 Y 4

This gives a summary information of the total datasets¶

In [19]:
credit_pred.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32581 entries, 0 to 32580
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   person_age                  32581 non-null  int64  
 1   person_income               32581 non-null  int64  
 2   person_home_ownership       32581 non-null  object 
 3   person_emp_length           31686 non-null  float64
 4   loan_intent                 32581 non-null  object 
 5   loan_grade                  32581 non-null  object 
 6   loan_amnt                   32581 non-null  int64  
 7   loan_int_rate               29465 non-null  float64
 8   loan_status                 32581 non-null  int64  
 9   loan_percent_income         32581 non-null  float64
 10  cb_person_default_on_file   32581 non-null  object 
 11  cb_person_cred_hist_length  32581 non-null  int64  
dtypes: float64(3), int64(5), object(4)
memory usage: 3.0+ MB

From the code above, it is obvious that the dataset has some null/empty values¶

The is a need to determine the total null values to enable accurate prediction¶

In [4]:
credit_pred.isnull().sum()
Out[4]:
person_age                       0
person_income                    0
person_home_ownership            0
person_emp_length              895
loan_intent                      0
loan_grade                       0
loan_amnt                        0
loan_int_rate                 3116
loan_status                      0
loan_percent_income              0
cb_person_default_on_file        0
cb_person_cred_hist_length       0
dtype: int64

The column that is labelled "person_emp_length" has an empty column of 895 values likewise "loan_int_rate" with a null values of 3116¶

I would drop the null values so it doesnt affect my output result¶

In [9]:
credit_pred = credit_pred.dropna()
In [10]:
credit_pred.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 28638 entries, 0 to 32580
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   person_age                  28638 non-null  int64  
 1   person_income               28638 non-null  int64  
 2   person_home_ownership       28638 non-null  object 
 3   person_emp_length           28638 non-null  float64
 4   loan_intent                 28638 non-null  object 
 5   loan_grade                  28638 non-null  object 
 6   loan_amnt                   28638 non-null  int64  
 7   loan_int_rate               28638 non-null  float64
 8   loan_status                 28638 non-null  int64  
 9   loan_percent_income         28638 non-null  float64
 10  cb_person_default_on_file   28638 non-null  object 
 11  cb_person_cred_hist_length  28638 non-null  int64  
dtypes: float64(3), int64(5), object(4)
memory usage: 2.8+ MB

The Figure above has no null value therefore its cleaned from causing errors in our output variable¶

Using a heatmap to visualize the null value¶

In [9]:
sns.heatmap(credit_pred.isnull())
Out[9]:
<AxesSubplot: >

Visualising more insight from the given data, by finding the total values, mean, Standard Derviation, median and mode¶

In [10]:
pred = credit_pred.describe()
pred.style.background_gradient (cmap = 'PuBu')
Out[10]:
  person_age person_income person_emp_length loan_amnt loan_int_rate loan_status loan_percent_income cb_person_cred_hist_length
count 28638.000000 28638.000000 28638.000000 28638.000000 28638.000000 28638.000000 28638.000000 28638.000000
mean 27.727216 66649.371884 4.788672 9656.493121 11.039867 0.216600 0.169488 5.793736
std 6.310441 62356.447405 4.154627 6329.683361 3.229372 0.411935 0.106393 4.038483
min 20.000000 4000.000000 0.000000 500.000000 5.420000 0.000000 0.000000 2.000000
25% 23.000000 39480.000000 2.000000 5000.000000 7.900000 0.000000 0.090000 3.000000
50% 26.000000 55956.000000 4.000000 8000.000000 10.990000 0.000000 0.150000 4.000000
75% 30.000000 80000.000000 7.000000 12500.000000 13.480000 0.000000 0.230000 8.000000
max 144.000000 6000000.000000 123.000000 35000.000000 23.220000 1.000000 0.830000 30.000000
In [11]:
corr = credit_pred.corr()
f, ax = plt.subplots(figsize = (25,25))
sns.heatmap(corr, annot= True)
corr.round(0)
/tmp/ipykernel_42123/3041411044.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  corr = credit_pred.corr()
Out[11]:
person_age person_income person_emp_length loan_amnt loan_int_rate loan_status loan_percent_income cb_person_cred_hist_length
person_age 1.0 0.0 0.0 0.0 0.0 -0.0 -0.0 1.0
person_income 0.0 1.0 0.0 0.0 -0.0 -0.0 -0.0 0.0
person_emp_length 0.0 0.0 1.0 0.0 -0.0 -0.0 -0.0 0.0
loan_amnt 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0
loan_int_rate 0.0 -0.0 -0.0 0.0 1.0 0.0 0.0 0.0
loan_status -0.0 -0.0 -0.0 0.0 0.0 1.0 0.0 -0.0
loan_percent_income -0.0 -0.0 -0.0 1.0 0.0 0.0 1.0 -0.0
cb_person_cred_hist_length 1.0 0.0 0.0 0.0 0.0 -0.0 -0.0 1.0

Plotting the relationship between person_age and loan_status, if age would affect the tendency of repayment¶

From the fig below, shows the lower the income, the lower the tendency to payback, while the younger the age, the lower the tendency of repayment.¶

In [82]:
plt.figure(figsize = [40,20])
sns.countplot(x= 'loan_percent_income', hue= 'loan_status', data= credit_pred);
In [83]:
plt.figure(figsize = [25,15])
sns.countplot(x= 'person_age', hue= 'loan_status', data= credit_pred);
In [13]:
defaulter = credit_pred [credit_pred['loan_status'] == 1]
non_defaulter = credit_pred [credit_pred ['loan_status'] == 0]
In [14]:
fig_A = px.histogram(defaulter, x = 'loan_intent', color = 'loan_intent', template = 'plotly_dark')
fig_A.show()
In [15]:
fig_A = px.histogram(defaulter, x = 'person_age', color = 'person_age', template = 'plotly_dark')
fig_A.show()
In [89]:
fig_A = px.histogram(defaulter, x = 'person_income', color = 'person_income', template = 'plotly_dark')
fig_A.show()
In [90]:
fig_A = px.histogram(defaulter, x = 'person_home_ownership', color = 'person_home_ownership', template = 'plotly_dark')
fig_A.show()
In [91]:
fig_A = px.histogram(defaulter, x = 'cb_person_default_on_file', color = 'cb_person_default_on_file', template = 'plotly_dark')
fig_A.show()
In [92]:
fig_A = px.histogram(defaulter, x = 'loan_amnt', color = 'loan_amnt', template = 'plotly_dark')
fig_A.show()
In [93]:
fig_A = px.histogram(non_defaulter, x = 'loan_intent', color = 'loan_intent', template = 'plotly_dark')
fig_A.show()
In [100]:
fig_A = px.histogram(non_defaulter, x = 'person_age', color = 'person_age', template = 'plotly_dark')
fig_A.show()
In [101]:
fig_A = px.histogram(non_defaulter, x = 'person_income', color = 'person_income', template = 'plotly_dark')
fig_A.show()
In [102]:
fig_A = px.histogram(non_defaulter, x = 'person_home_ownership', color = 'person_home_ownership', template = 'plotly_dark')
fig_A.show()
In [103]:
fig_A = px.histogram(non_defaulter, x = 'cb_person_default_on_file', color = 'cb_person_default_on_file', template = 'plotly_dark')
fig_A.show()
In [17]:
fig_A = px.histogram(non_defaulter, x = 'loan_amnt', color = 'loan_amnt', template = 'plotly_dark')
fig_A.show()
In [20]:
grafico = px.scatter_matrix(credit_pred, dimensions=['person_age', 'person_income', 'cb_person_cred_hist_length', 'loan_amnt'], color = 'loan_status')
grafico.show()
In [22]:
grafico = px.parallel_categories (credit_pred, dimensions = {'loan_intent', 'loan_grade', 'loan_status'})
grafico.show()

Dropping the predicting class variable¶

In [11]:
X_pred = credit_pred.drop (columns = ['loan_status'])
X_pred
Out[11]:
person_age person_income person_home_ownership person_emp_length loan_intent loan_grade loan_amnt loan_int_rate loan_percent_income cb_person_default_on_file cb_person_cred_hist_length
0 22 59000 RENT 123.0 PERSONAL D 35000 16.02 0.59 Y 3
1 21 9600 OWN 5.0 EDUCATION B 1000 11.14 0.10 N 2
2 25 9600 MORTGAGE 1.0 MEDICAL C 5500 12.87 0.57 N 3
3 23 65500 RENT 4.0 MEDICAL C 35000 15.23 0.53 N 2
4 24 54400 RENT 8.0 MEDICAL C 35000 14.27 0.55 Y 4
... ... ... ... ... ... ... ... ... ... ... ...
32576 57 53000 MORTGAGE 1.0 PERSONAL C 5800 13.16 0.11 N 30
32577 54 120000 MORTGAGE 4.0 PERSONAL A 17625 7.49 0.15 N 19
32578 65 76000 RENT 3.0 HOMEIMPROVEMENT B 35000 10.99 0.46 N 28
32579 56 150000 MORTGAGE 5.0 PERSONAL B 15000 11.48 0.10 N 26
32580 66 42000 RENT 2.0 MEDICAL B 6475 9.99 0.15 N 30

28638 rows × 11 columns

In [12]:
X_pred.values
Out[12]:
array([[22, 59000, 'RENT', ..., 0.59, 'Y', 3],
       [21, 9600, 'OWN', ..., 0.1, 'N', 2],
       [25, 9600, 'MORTGAGE', ..., 0.57, 'N', 3],
       ...,
       [65, 76000, 'RENT', ..., 0.46, 'N', 28],
       [56, 150000, 'MORTGAGE', ..., 0.1, 'N', 26],
       [66, 42000, 'RENT', ..., 0.15, 'N', 30]], dtype=object)
In [13]:
X_pred = X_pred.values
In [14]:
type (X_pred)
Out[14]:
numpy.ndarray
In [15]:
y_pred = credit_pred.iloc [:,8].values
y_pred
Out[15]:
array([1, 0, 1, ..., 1, 0, 0])

Using the Label Encoder¶

In [16]:
label_encoder_teste = LabelEncoder()
In [17]:
X_pred [4]
Out[17]:
array([24, 54400, 'RENT', 8.0, 'MEDICAL', 'C', 35000, 14.27, 0.55, 'Y', 4],
      dtype=object)
In [18]:
label_encoder_person_home_ownership = LabelEncoder()
label_encoder_loan_intent = LabelEncoder()
label_encoder_loan_grade = LabelEncoder()
label_encoder_cb_person_default_on_file = LabelEncoder()
In [19]:
X_pred [:, 2] = label_encoder_person_home_ownership.fit_transform (X_pred [:,2])
X_pred [:, 4] = label_encoder_loan_intent.fit_transform (X_pred [:,4])
X_pred [:, 5] = label_encoder_loan_grade.fit_transform (X_pred [:,5])
X_pred [:, 9] = label_encoder_cb_person_default_on_file.fit_transform (X_pred [:,9])
In [20]:
X_pred [4]
Out[20]:
array([24, 54400, 3, 8.0, 3, 2, 35000, 14.27, 0.55, 1, 4], dtype=object)
In [21]:
X_pred
Out[21]:
array([[22, 59000, 3, ..., 0.59, 1, 3],
       [21, 9600, 2, ..., 0.1, 0, 2],
       [25, 9600, 0, ..., 0.57, 0, 3],
       ...,
       [65, 76000, 3, ..., 0.46, 0, 28],
       [56, 150000, 0, ..., 0.1, 0, 26],
       [66, 42000, 3, ..., 0.15, 0, 30]], dtype=object)
In [22]:
onehotencoder_pred = ColumnTransformer(transformers= [('OneHot', OneHotEncoder(), [0,2,4,5,9,10])], remainder = 'passthrough')
In [23]:
X_pred = onehotencoder_pred.fit_transform (X_pred). toarray()
In [24]:
X_pred
Out[24]:
array([[0.000e+00, 0.000e+00, 1.000e+00, ..., 3.500e+04, 1.602e+01,
        5.900e-01],
       [0.000e+00, 1.000e+00, 0.000e+00, ..., 1.000e+03, 1.114e+01,
        1.000e-01],
       [0.000e+00, 0.000e+00, 0.000e+00, ..., 5.500e+03, 1.287e+01,
        5.700e-01],
       ...,
       [0.000e+00, 0.000e+00, 0.000e+00, ..., 3.500e+04, 1.099e+01,
        4.600e-01],
       [0.000e+00, 0.000e+00, 0.000e+00, ..., 1.500e+04, 1.148e+01,
        1.000e-01],
       [0.000e+00, 0.000e+00, 0.000e+00, ..., 6.475e+03, 9.990e+00,
        1.500e-01]])
In [25]:
X_pred.shape
Out[25]:
(28638, 110)

Scaling of each values¶

In [26]:
scaler_pred = StandardScaler()
X_pred = scaler_pred.fit_transform(X_pred)
In [27]:
X_pred [0]
Out[27]:
array([-2.21156066e-02, -1.96148135e-01,  2.83796804e+00, -3.67834582e-01,
       -3.50295021e-01, -3.22636605e-01, -2.88538610e-01, -2.65592581e-01,
       -2.45187635e-01, -2.34522752e-01, -2.02305697e-01, -1.91002839e-01,
       -1.75953836e-01, -1.64887684e-01, -1.49368905e-01, -1.41096125e-01,
       -1.30839248e-01, -1.20368793e-01, -1.06470402e-01, -9.75590411e-02,
       -9.11552192e-02, -8.75822726e-02, -7.61246586e-02, -7.05914692e-02,
       -6.37733171e-02, -5.64599460e-02, -5.48821300e-02, -5.15836998e-02,
       -4.76956485e-02, -3.78644533e-02, -3.96712966e-02, -3.39653422e-02,
       -3.39653422e-02, -3.01448110e-02, -2.70892883e-02, -2.50784931e-02,
       -2.21156066e-02, -2.28922276e-02, -2.43714887e-02, -1.32145256e-02,
       -2.13107595e-02, -1.67160753e-02, -1.44760403e-02, -1.02355700e-02,
       -1.56361836e-02, -1.32145256e-02, -1.67160753e-02, -5.90930274e-03,
       -1.32145256e-02, -1.32145256e-02, -8.35716200e-03, -5.90930274e-03,
       -5.90930274e-03, -5.90930274e-03, -5.90930274e-03, -5.90930274e-03,
       -1.02355700e-02, -8.37195816e-01, -5.73860735e-02, -2.87899081e-01,
        9.83926906e-01, -4.35467034e-01, -4.98712041e-01, -3.54552601e-01,
       -4.76161204e-01,  2.20727264e+00, -4.59972905e-01, -6.99121631e-01,
       -6.85270103e-01, -4.98439082e-01,  2.79591098e+00, -1.77005730e-01,
       -8.57417516e-02, -4.54362512e-02, -2.14755511e+00,  2.14755511e+00,
       -4.72349845e-01,  2.11484701e+00, -4.72571017e-01, -2.48055676e-01,
       -2.46704525e-01, -2.49085105e-01, -2.48451997e-01, -2.48372772e-01,
       -2.46066714e-01, -1.20517988e-01, -1.22881884e-01, -1.17041147e-01,
       -1.24337832e-01, -1.14407190e-01, -1.17041147e-01, -1.10749198e-01,
       -2.28922276e-02, -2.50784931e-02, -3.18381379e-02, -2.50784931e-02,
       -2.70892883e-02, -2.57661525e-02, -3.07195863e-02, -2.43714887e-02,
       -2.36434040e-02, -2.50784931e-02, -2.77272550e-02, -1.96023628e-02,
       -2.57661525e-02, -1.22673849e-01,  2.84534330e+01,  4.00398376e+00,
        1.54216384e+00,  3.95252678e+00])
In [28]:
X_pred_train, X_pred_test, y_pred_train, y_pred_test = train_test_split (X_pred, y_pred, test_size =0.30, random_state= 0)
In [29]:
X_pred_train.shape
Out[29]:
(20046, 110)
In [30]:
X_pred_test.shape
Out[30]:
(8592, 110)
In [31]:
X_pred_train.shape , y_pred_train.shape
Out[31]:
((20046, 110), (20046,))
In [32]:
X_pred_test.shape, y_pred_test.shape
Out[32]:
((8592, 110), (8592,))
In [33]:
naive_pred = GaussianNB()
naive_pred.fit(X_pred_train, y_pred_train)
predictions = naive_pred.predict (X_pred_test)
predictions
Out[33]:
array([1, 1, 1, ..., 1, 1, 1])
In [34]:
y_pred_test
Out[34]:
array([0, 0, 0, ..., 0, 0, 0])
In [35]:
y_pred_train
Out[35]:
array([0, 0, 0, ..., 0, 0, 0])

checking the accuracy¶

In [36]:
accuracy_score (y_pred_test, predictions)
Out[36]:
0.21752793296089384
In [37]:
confusion_matrix (y_pred_test, predictions)
Out[37]:
array([[  67, 6702],
       [  21, 1802]])
In [38]:
con_mat = ConfusionMatrix (naive_pred)
con_mat.fit (X_pred_train, y_pred_train)
con_mat.score (X_pred_test, y_pred_test,)
Out[38]:
0.21752793296089384
In [39]:
print (classification_report (y_pred_test, predictions))
              precision    recall  f1-score   support

           0       0.76      0.01      0.02      6769
           1       0.21      0.99      0.35      1823

    accuracy                           0.22      8592
   macro avg       0.49      0.50      0.18      8592
weighted avg       0.64      0.22      0.09      8592

Using Decision Tree model¶

In [40]:
from sklearn.tree import DecisionTreeClassifier
pred_tree = DecisionTreeClassifier(criterion = 'entropy')
pred_tree.fit(X_pred_train, y_pred_train)
Out[40]:
DecisionTreeClassifier(criterion='entropy')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(criterion='entropy')
In [41]:
prediction = pred_tree.predict (X_pred_test)
prediction
Out[41]:
array([0, 0, 0, ..., 0, 0, 0])
In [42]:
accuracy_score ( y_pred_test, prediction)
Out[42]:
0.8864059590316573
In [43]:
con_mat = ConfusionMatrix (pred_tree)
con_mat.fit (X_pred_train, y_pred_train)
con_mat.score (X_pred_test, y_pred_test,)
Out[43]:
0.8864059590316573
In [59]:
print (classification_report (y_pred_test, prediction))
              precision    recall  f1-score   support

           0       0.93      0.92      0.93      6769
           1       0.72      0.75      0.74      1823

    accuracy                           0.89      8592
   macro avg       0.83      0.84      0.83      8592
weighted avg       0.89      0.89      0.89      8592

In [45]:
from sklearn.ensemble import RandomForestClassifier
In [52]:
random_forest_pred = RandomForestClassifier(n_estimators=40, criterion='entropy', random_state = 0)
random_forest_pred.fit(X_pred_train, y_pred_train)
Out[52]:
RandomForestClassifier(criterion='entropy', n_estimators=40, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(criterion='entropy', n_estimators=40, random_state=0)
In [53]:
predictions = random_forest_credit.predict(X_pred_test)
predictions
Out[53]:
array([0, 0, 0, ..., 0, 0, 0])
In [54]:
y_pred_test
Out[54]:
array([0, 0, 0, ..., 0, 0, 0])
In [55]:
accuracy_score(y_pred_test, predictions)
Out[55]:
0.93237895716946
In [57]:
cm = ConfusionMatrix(random_forest_pred)
cm.fit(X_pred_train, y_pred_train)
cm.score(X_pred_test, y_pred_test)
Out[57]:
0.93237895716946
In [58]:
print(classification_report(y_pred_test, predictions))
              precision    recall  f1-score   support

           0       0.93      0.99      0.96      6769
           1       0.97      0.71      0.82      1823

    accuracy                           0.93      8592
   macro avg       0.95      0.85      0.89      8592
weighted avg       0.93      0.93      0.93      8592

Comparing the three machine learning aglorithms (Naive Bayes, Decision Tree and Random Forest Techniques), the Random Forest Technique has a better performance index with 93% accuracy considering the dataset used.

In [ ]: